In this investigation we will study the characteristics of the public bicycle network of the city of San Francisco. Allowing us to understand population circulation patterns. How often is this form of transport used? What are the busiest routes? Which stations have the most activity?
The database contains the record of the trips made by bicycle in the city of San Francisco during the last 10 months. It has more than 2.5 million records. Which have temporal information and the georeferenced coordinates of the start and end stations of each route, allowing the establishment of paths and their respective duration. Finally, it provides us with user data, such as what type of subscription they have. the csv file with the gather data is attached with the documentation.
urls = ['https://s3.amazonaws.com/baywheels-data/202003-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/202002-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/202001-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201912-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201911-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201910-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201909-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201908-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201907-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201906-baywheels-tripdata.csv.zip',
'https://s3.amazonaws.com/baywheels-data/201905-baywheels-tripdata.csv.zip',]
folder_name = 'data_base'
if not os.path.exists(folder_name):
os.makedirs(folder_name)
for url in urls:
response= requests.get(url)
with open(os.path.join(folder_name,url.split('/')[-1]), mode='wb') as file:
file.write(response.content)
os.listdir(folder_name)
df_bike.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| duration_sec | 2487402.0 | 752.926556 | 785.662463 | 60.000000 | 372.000000 | 587.000000 | 906.000000 | 21582.000000 |
| start_station_id | 1921286.0 | 154.310646 | 130.504991 | 3.000000 | 50.000000 | 109.000000 | 246.000000 | 521.000000 |
| start_station_latitude | 2487402.0 | 37.757615 | 0.115452 | 36.163261 | 37.767079 | 37.778742 | 37.794130 | 37.895300 |
| start_station_longitude | 2487402.0 | -122.353212 | 0.143293 | -122.514299 | -122.416858 | -122.400239 | -122.390288 | -86.775177 |
| end_station_id | 1919693.0 | 148.668296 | 129.237512 | 3.000000 | 42.000000 | 104.000000 | 242.000000 | 521.000000 |
| end_station_latitude | 2487402.0 | 37.757775 | 0.115372 | 36.163140 | 37.767155 | 37.778768 | 37.794223 | 37.995942 |
| end_station_longitude | 2487402.0 | -122.352518 | 0.142755 | -122.575763 | -122.414817 | -122.399579 | -122.390288 | -86.775177 |
| bike_id | 2487402.0 | 151317.814366 | 260662.665328 | 12.000000 | 2383.000000 | 10172.000000 | 230062.000000 | 999960.000000 |
| distance_milles | 2487402.0 | 1.104625 | 0.717761 | 0.000038 | 0.589529 | 0.933539 | 1.450430 | 41.918678 |
The database is divided into 15 columns. We could make a subset that contains complementary information:
- Temporary data:
'duration',‘start_time’,'end_time',- Arrival point:
'start_station_id','start_station_name','start_station_latitude','start_station_longitude ','end_station_id','end_station_name','end_station_latitude','end_station_longitude',- bike and user information:
'bike_id','user_type','bike_share_for_all_trip','rental_access_method'### The main features of interest in the datasetThe main features of interest in the data set are those related to the spatial and temporal information of the routes.
- On the one hand, the references to the coordinates of the start and end stations of the routes.
- On the other hand, the temporal information of the routes
Combining these data we can obtain a representation of the circulation patterns of the city of San Francisco.
At a second level, understanding the duration of the journeys can allow us to understand what the state of the roads is. That is, if a route doubles its duration in a certain period of the day, or an obstacle has arisen or the road that communicates it has become congested. Also, extracting information about the users could be useful to accurately target marketing or promotion campaigns that encourage the use of non-motorized vehicles within the city.
In this section, investigate distributions of individual variables.
- quantitatives variables:
'duration_sec','distance_milles'- qualitatives variables:
'station_name','day_week','hour'
plt.figure(figsize=(16,9))
bins_edges = 10**np.arange(0, np.log10(df_bike['duration_sec']).max()+0.04,0.04)
plt.hist(data= df_bike, x='duration_sec', bins= bins_edges)
ticks=[100, 200, 400, 1000, 2000, 4000 ]
labels = ['{}'.format(i) for i in ticks]
plt.xscale('log')
plt.xlim(60, 5000)
plt.xticks(ticks, labels)
plt.xlabel('duration (in sec)')
plt.ylabel('number of trip')
plt.title('Distribution of the trips duration');
The scale of the variable using the log function shows to us a normal distributed curve. with a pick of 140.000 travels lasting between 8 minutes and 13 minutes.
plt.figure(figsize=(16,9))
bins_edges = 10**np.arange(0, np.log10(df_bike['distance_milles']).max()+0.04,0.04)
plt.hist(data=df_bike, x='distance_milles', bins=bins_edges)
ticks=[1, 1.5, 2, 3, 5, 10 ]
labels = ['{}'.format(i) for i in ticks]
plt.xscale('log')
plt.xlim([0,8])
plt.xticks(ticks, labels)
plt.xlabel('distance (in milles)')
plt.ylabel('number of trip')
plt.title('Distribution of the trips distance');
<ipython-input-7-7a90956ccf20>:10: UserWarning: Attempted to set non-positive left xlim on a log-scaled axis. Invalid limit will be ignored. plt.xlim([0,8])
the distribution of the distance traveled by bike is according to what we expected. a left squeded distribution. with the pick of the distance traveled in less than 1.5 milles.
by comparing the days of the week we can have a big picture of the propuses and types of the travels. let's check with the barchar function, and extracting this time data from the variables that we correct in prior steps.
barchar(df_bike, 'start_day')
barchar(df_bike, 'end_day')
In both cases, the number of trips during the weekend is reduced. In Monday, Friday it also has a slight decrease in the number of trips compared to the other days of the week. This may be due to the growth of the home office, or also to the transfer to the second residences in the suburbs of the city, directly after to end and/or to start the week.
to have a knoledge of the hour of the travels is a great advantage not just to build a redistribution system of the bikes but also to have a general picture of the city transit.
barchar(df_bike, 'start_hour')
barchar(df_bike, 'end_hour')
We observe that during rush hour, between 8 and 10 in the morning and between 17 and 19 at night, 40% of the trips happen. We can say that it is a transport service widely used to run daily commutes from home-work or educational center. This indicates to us that San Francisco has a well connected bike-network that does not affect the duration of the trips and ensures users good performances.
the firs spatial approach that we can do is with the station names. let's see witch ones are the most required stations
# index of the most counted stations in the prior dataframe will be the advance filtered use to focus on the top 25 stations.
top_25_stations = list(df_bike_stations.station_name.value_counts().head(25).index)
df_top_25 = df_bike_stations.loc[df_bike_stations.station_name.isin(top_25_stations)]
barchar(df_top_25, 'station_name', top_25_stations, 90)
this 25 stations are involve in more than a millon of travels in the last ten months. the number of bikes that are in circulation between only this 25 stations is amazing. with more than 80.000 travels San Francisco Caltrain (townsend st at 4th st) is the most requiered station in the San Francisco bay.
barchar(df_bike, 'user_type')
In parallel with what we saw previously, 70% of users are subscribed, that is, they have this service on a daily basis
The variables of interest, duration of the journey and distance traveled have expected distributions. the duration of the trips should be scaled with the log function, in order to obtain a normal distribution that had its peak between 8 and 10 minutes. instead the distance presents a left squeded curve, to obtain these results it had to be scaled as well. This variable was obtained as a result of performing the distance function of geopy. With these variables we can not only understand the rhythm of the transfers but also what type of routes users are taking.
Regarding the types of users, we find a reasonable difference between the use of subscribers and customres. We use the melt function to understand which are the most popular stations and collect the efonque at the first 25. which are the most frequented, which stations have the largest exchange of bikes. To finish the "univariate time series" they show us on different scales the temporal distribution of the use that is given to the bicycle sharing service
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
We can begin by studying how the use of the city's public bicycle service varies. With the records of the last 10 months (the year was not completed because it contained files in poor condition). For this we can make a time series with the distance traveled per month.
plt.figure(figsize=(16,9))
sns.lineplot(x='month_period', y="distance_milles", data=df_bike_period)
plt.title('Traveled distance by month')
plt.xlabel('months')
plt.ylabel('distance (in milles)');
With this curve we can see how current events have had an impact on people's mobility. We refer to the impact of the pandemic as of February 2020. The tremendous increase in miles traveled has been abruptly interrupted by COVID-19 and the respective suspension of service. 500,000 miles traveled per month to 0 miles in less than two months. In short, this curve, more than the evolution of the use of the public bicycle network, shows us the impact of the pandemic, and the response of citizens.
to mesure this tow variables lets use a scatter plot.
plt.figure(figsize=(16,9))
plt.scatter(data=df_bike, x='distance_milles', y='duration_sec', alpha=1/10)
plt.title('travel duration by distance')
plt.xlabel('distance (in milles)')
plt.ylabel('duration (in sec)')
plt.xlim(0, 10)
plt.ylim(0, 10000);
plt.figure(figsize=(16,9))
bins_x = np.arange(0.5, 8+0.25, 0.25)
bins_y = np.arange(-0.5, 10600+200, 200)
h2d = plt.hist2d(data = df_bike, x = 'distance_milles', y = 'duration_sec',
bins = [bins_x, bins_y], cmap = 'viridis_r', cmin = 800)
plt.title('travel duration by distance')
plt.xlabel('distance (in milles)')
plt.ylabel('duration (in sec)')
plt.xlim(0,5)
plt.ylim(0,3000)
plt.colorbar()
counts = h2d[0]
# loop through the cell counts and add text annotations for each
for i in range(counts.shape[0]):
for j in range(counts.shape[1]):
c = counts[i,j]
if c >= 15000: # increase visibility on darkest cells
plt.text(bins_x[i]+0.12, bins_y[j]+80, int(c),
ha = 'center', va = 'center', color = 'white')
elif c > 0:
plt.text(bins_x[i]+0.13, bins_y[j]+70, int(c),
ha = 'center', va = 'center', color = 'black')
#set bins edges, compute center
plt.figure(figsize=(16,9))
bin_size = 0.5
xbin_edges = np.arange(0, 8+bin_size, bin_size)
xbin_centers = (xbin_edges + bin_size/2)[:-1]
#compute statistics in each bin
data_xbins = pd.cut(df_bike['distance_milles'], xbin_edges, right= False, include_lowest=True)
y_means= df_bike['duration_sec'].groupby(data_xbins).mean()
y_sems = df_bike['duration_sec'].groupby(data_xbins).sem()
#plot the summarized data
plt.errorbar(x = xbin_centers, y = y_means, yerr = y_sems)
plt.title('travel duration by distance')
plt.xlabel('distance in milles')
plt.ylabel('duration in seg')
Text(0, 0.5, 'duration in seg')
Despite the noise of the scatterplot, both the heatmap and the lineplot confirm a strong correlation between the distance and the duration of the journey for the first few miles. From the sixth mile, we observe a reduction in the speed of the routes. This may be due to commuting, poor infrastructure for non-motor vehicles the fact that the curve drops again may be indicating inter-urban routes that have fewer road interruptions such as traffic lights or roundabouts, which allow speeding up the circulation times between points within the city
From the previous plots, we can mention the average speed of cycling in the bay of San Francisco. An interesting analysis is to contemplate the oscillations of this variable throughout the hours of the day. The reduction of speeds will indicate the most congested moments of the day as well as which are the most appropriate to use the bicycle.
barchar(df_bike, 'start_hour', None, None, 'user_type')
barchar(df_bike, 'end_hour', None, None, 'user_type')
The relationship between subscribed users and cutomers changes throughout the daylight hours. in rush hour periods, subscripts almost tripled in number of trips, however in the intermediate period this difference is considerably reduced. and in both it coincides with minimum values during the night period.
plt.figure(figsize=(16,12))
# subplot 1: distance vs hour
plt.subplot(3, 1, 1)
base_color = sns.color_palette()[0]
sns.boxplot(data = df_bike, x = 'start_hour', y = 'distance_milles', color = base_color)
plt.ylim(0,6)
plt.title('distance by hour of th day')
plt.ylabel('distance (in milles))')
plt.xlabel('hour of the day');
# subplot 2: duration vs hour
plt.subplot(3, 1, 2)
base_color = sns.color_palette()[1]
sns.boxplot(data = df_bike, x = 'start_hour', y = 'duration_sec', color = base_color)
plt.ylim(0,3000)
plt.title('duration by hour of th day')
plt.ylabel('duration (in sec))')
plt.xlabel('hour of the day');
# subplot 3: speed vs. hour
plt.subplot(3, 1, 3)
base_color = sns.color_palette()[2]
sns.boxplot(data = df_bike, x = 'start_hour', y = 'speed_mph', color = base_color)
plt.title('Speed by hour of th day')
plt.ylabel('speed (in MPH))')
plt.xlabel('hour of the day');
The average distance traveled in the city is one mile. definitely the public network of bikes, are used as an intercity vehicle. We can refer to the transfer of the last mile.
also the duration of the matches coincides. With medians of less than 8 minutes.
the highest speeds occur at 5 a.m., but generally speaking a constant speed of between 6 and 7 mph is achieved. This indicates that the rush hour traffic congestion does not affect the urban bicycle circuit.
Segmenting the use of the customer with that of the subscriber, we can study whether the tourist circuit and the local overlap in the use of the stations. an increase in the proportion of user types customer would indicate this. In the ferry station in the bay we note that there is an approach between the values, as in Embarcadero. Seeing them represented over time could tell us if these characteristics are compromising the availability of bicycles.
barchar(df_top_25, 'station_name', top_25_stations, 90, 'user_type')
using the georeferenced data we can visualize where the stations are. The heatmap presents the intensity of bicycle use in a range of colors. the folium tool allows us to do this in an interactive way throughout the planisphere
mapa = folium.Map(location=[37.8068, -122.3807],
zoom_start=11.5,
tiles='cartodbpositron')
heat_df = df_bike.sample(40000)
heat_df = heat_df[['start_station_latitude', 'start_station_longitude']]
heat_data = [[row['start_station_latitude'], row['start_station_longitude']] for index, row in heat_df.iterrows()]
mapa.add_child(HeatMap(heat_data))
mapa
Most of the trips are concentrated in the rush hour. Within this time spectrum, most of the communication routes of the city are saturated. Let us observe which are the routes that predominate in the periods from 8 am to 9 am and from 5 pm to 6 pm.
m = folium.Map(location=[37.8068, -122.3807],
zoom_start=11.5,
tiles='cartodbpositron')
df_morning = df_bike[df_bike.start_hour=='08']
for index, row in df_morning.sample(500).iterrows():
folium.CircleMarker(location=[row['start_station_latitude'],row['start_station_longitude']],
#radius=radi,
color="#0A8A9F",
popup='star station',
fill=True).add_to(m)
folium.CircleMarker(location=[row['end_station_latitude'],row['end_station_longitude']],
#radius=radi,
color="#E37222",
popup='end station',
fill=True).add_to(m)
m
m = folium.Map(location=[37.8068, -122.3807],
zoom_start=11.5,
tiles='cartodbpositron')
df_morning = df_bike[df_bike.start_hour=='17']
for index, row in df_morning.sample(500).iterrows():
folium.CircleMarker(location=[row['start_station_latitude'],row['start_station_longitude']],
#radius=radi,
color="#0A8A9F",
popup='star station',
fill=True).add_to(m)
folium.CircleMarker(location=[row['end_station_latitude'],row['end_station_longitude']],
#radius=radi,
color="#E37222",
popup='end station',
fill=True).add_to(m)
m
Comparing both planes we can see how the type of station is inverted. those that in the morning are a starting point mostly at night are a destination. In San Francisco these trslados happen from the interior to the east coast, and then reverse in the afternoons.
Despite the noise, we can see a strong positive correlation between the distance and the duration of the tours. Throughout the day, the variables of distance, duration and speed seem stable. as we had seen the displacements are distributed within 3 areas
As we have seen, displacements are distributed within 3 areas.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
plt.figure(figsize=(16,12))
# subplot 1: distance vs hour
plt.subplot(3, 1, 1)
base_color = sns.color_palette()[0]
sns.pointplot(data = df_bike, x = 'start_hour', y = 'distance_milles', hue='user_type',
color = base_color,
markers=["o", "x"],
linestyles=["-", "--"])
plt.title('distance by hour of th day')
plt.ylabel('distance (in milles))')
plt.xlabel('hour of the day');
# subplot 2: duration vs hour
plt.subplot(3, 1, 2)
base_color = sns.color_palette()[1]
sns.pointplot(data = df_bike, x = 'start_hour', y = 'duration_sec', hue='user_type',
color = base_color,
markers=["o", "x"],
linestyles=["-", "--"])
plt.title('duration by hour of th day')
plt.ylabel('duration (in sec))')
plt.xlabel('hour of the day');
# subplot 3: speed vs. hour
plt.subplot(3, 1, 3)
base_color = sns.color_palette()[2]
sns.pointplot(data = df_bike, x = 'start_hour', y = 'speed_mph', hue='user_type',
color = base_color,
markers=["o", "x"],
linestyles=["-", "--"])
plt.title('Speed by hour of th day')
plt.ylabel('speed (in MPH))')
plt.xlabel('hour of the day');
both the distance of the tours of the customers and the duration is greater than that of the subscribers. This can be explained because it is not their main means of transport or even because they are using it in a recreational way. However, the speed is higher in the case of subscribed users.
around 5 in the morning the average distance traveled by subscribers exceeds that of customers, however, after 9 in the morning their descent is much more abrupt.
Regarding the duration we see a considerable increase between 12 and 16 hours among the custumers, these values ​​may indicate tourist or recreational tours.
The speeds have some parallelism, with a peak of 7.5 mph on average at 5 in the morning.
plt.figure(figsize=(16,9))
g = sns.FacetGrid(data = df_bike, col = 'user_type', col_wrap = 2, size = 6, margin_titles=True)
g.map(hist2dgrid, 'distance_milles', 'duration_sec', color = 'inferno_r')
g.set(xlim=(0, 4))
g.set(ylim=(0, 3000))
g.set_xlabels('distance')
g.set_ylabels('duration')
<seaborn.axisgrid.FacetGrid at 0x23e1d3ad580>
<Figure size 1152x648 with 0 Axes>
after having presented the most frequent stations within the city. we want to see how these points behave throughout the day and according to the different types of user.
This series of barplot records the number of trips that depart from each station throughout the day and the number that arrive at each of the stations.
top_10_stations = list(df_bike_stations.station_name.value_counts().head(10).index)
for station in top_10_stations:
df_start_station = df_bike.loc[df_bike['start_station_name']== station]
df_end_station = df_bike.loc[df_bike['end_station_name']== station]
plt.figure(figsize=(16,9))
plt.subplot(2, 1, 1)
base_color = sns.color_palette()[0]
sns.countplot(data = df_start_station, x = 'start_hour', hue = 'user_type', color = base_color)
plt.title('Number of travels with {} as starting point by hour'.format(station))
plt.ylabel('number of travels')
plt.xlabel('hour of the day')
plt.subplot(2, 1, 2)
base_color = sns.color_palette()[1]
sns.countplot(data = df_end_station, x = 'end_hour', hue = 'user_type', color = base_color)
plt.title('Number of travels with {} as destination point by hour'.format(station))
plt.ylabel('number of travels')
plt.xlabel('hour of the day')
First we highlight how customers fluctuate less throughout the day. the interesting thing about these plots is to see how the peaks of the subscribers work as a mirror. in other words, when a station has many departures one morning, during the afternoon it will receive the respective returns. conversely, the stations in the work areas receive many arrivals in the morning and many departures in the afternoon. It is a great indication of the distribution of land uses and also helps to maintain the correct distribution of bicycles within the network.
el heatmap con time series puede ser un gran recurso para mostrar lo que planteamos anteriormente. podemos construir uno usando la herremienta de folium .
mapa = folium.Map(location=[37.8068, -122.3807],
zoom_start=11.5,
tiles='cartodbpositron')
heat_df = df_bike.sample(45000)
heat_df = heat_df[['start_station_latitude', 'start_station_longitude', 'start_hour']]
day_hour = df_bike['start_hour'].sort_values().unique()
heat_data = [[[row['start_station_latitude'], row['start_station_longitude']] for index, row in heat_df[heat_df['start_hour']==i].iterrows()] for i in day_hour ]
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(mapa)
# Display the map
mapa
mapa = folium.Map(location=[37.8068, -122.3807],
zoom_start=11.5,
tiles='cartodbpositron')
heat_df = df_bike.sample(45000)
heat_df = heat_df[['end_station_latitude', 'end_station_longitude', 'end_hour']]
day_hour = df_bike['end_hour'].sort_values().unique()
heat_data = [[[row['end_station_latitude'], row['end_station_longitude']] for index, row in heat_df[heat_df['end_hour']==i].iterrows()] for i in day_hour ]
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(mapa)
# Display the map
mapa
We observe how the speed, the distance, and the duration of the journeys throughout the day vary according to the type of user. We also decided to locate the distribution of the trips and distinguish the departure points from those of arrival. With these processes we recognized the pendular movement of the populations between certain stations and what are the average values of these routes.
It has been interesting how to see how the sahring use of San Francisco is a reflection of the circulation in the city. how it is distributed within 3 sectors. the behavior of different types of user is also a great indication. but what stands out the most is the possibility with the data you have of influencing how to replace the bikes between stations